#Loading the data set and preparing it for the analysis. The required packages were loaded.
suppressMessages(library("DESeq2"))
suppressMessages(library("tidyverse"))
suppressMessages(library("recount"))
suppressMessages(library("ggpubr"))
suppressMessages(library("EnhancedVolcano"))
suppressMessages(library("EnsDb.Hsapiens.v86"))
suppressMessages(library("msigdbr"))
suppressMessages(library("clusterProfiler"))
suppressMessages(library("PoiClaClu"))
suppressMessages(library("pheatmap"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("org.Hs.eg.db"))
suppressMessages(library("DT"))
For this project, I wanted to use RNAseq dataset to compare treated and untreated conditions. So I created a project info to search for treated control models into recount database. I found the SRP065849 fit my description, so I downloaded count data of it.
project_info <- abstract_search(query = "treated control")
View(project_info)
selected_study <- "SRP065849"
download_study(selected_study)
The count data was loaded. The read count matrix and sample metadata was checked if they are ready for analysis. I noticed the condition data was not present, so I have to add it manually. I obtained the condition details from the link below and added as a condition column to the rse_gene dataframe.
From here the condition details were obtained for this experiment. https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP065849&o=acc_s%3Aa
I also noticed that the gene IDs were not in correct format to search, so the gene ids were corrected by deletin the dots present in the IDs.
load("SRP065849/rse_gene.Rdata")
cts <- assay(rse_gene)
View(cts)
cd <- as.data.frame(colData(rse_gene))
View(cd)
rse_gene$condition <- c(rep("treated",5), rep("untreated", 4))
cd <- as.data.frame(colData(rse_gene))
View(cd)
rownames(rse_gene) <- gsub(rownames(rse_gene),
pattern = "\\..+",
replacement = "")
DESeq data set was created as dds, by setting condition as design. Design means which property you want to compare on your samples. After that a heatmap plotted to visulize Poisson distances between samples. This plot tells us whether this count data will give the answers we need or not. In this case, if there is differential gene expression present between the treated and untreated samples, we expect to see color difference between treated and untreated samples in the heatmap.
I selected to use Poisson distance because the data was in raw form (non-normalized). If it was a normalized dataset, I’d have to use Euclidean distance.
As you can see in the heatmap, color differince present. This means, we can perform differential expression analysis in this dataset.
dds <- DESeqDataSet(rse_gene, design = ~ condition)
poisd <- PoissonDistance(t(counts(dds)))
samplePoisDistMatrix <- as.matrix( poisd$dd)
rownames(samplePoisDistMatrix) <- paste(dds$condition, sep=" - " )
colnames(samplePoisDistMatrix) <- NULL
colors = colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(samplePoisDistMatrix,
clustering_distance_rows = poisd$dd,
clustering_distance_cols = poisd$dd,
col = colors)
When we’re working with DEGs analysis, first thing we need to do is to transform the data. I applied rlog transformation to dds dataset and saved as rld object. Then PCA plot was plotted. The variance in PCA plot indicates how strong the affects such as batch effect, contamination, etc. on the data.
rld <- rlog(dds)
plotPCA(rld, intgroup = c("condition"))
Differential gene expression analysis was performed and results were saved as results. The contrast in the results indicates that I want get the fold change of treated sample over control sample.
dds <- DESeq(dds)
results <- results(dds, contrast = c("condition","untreated","treated"))
MA plot was plotted to examine how the fold change related to mean of each genes that expressed. Notice that the lines near the axis. They indicates that low expressed genes with high fold changes. But they’re not significant. In order to get a more accurate plot for the fold change, we need to get rid of them statistically.
plotMA(results, ylim = c(-5, 5))
So, for this purpose LFC shrink normalization applied and saved as resNorm object. A MA plot was plotted again for normalized data. After that, I saved the resNorm again as data frame into results object.
resNorm <- lfcShrink(dds = dds, res = results, type = "normal", coef = 2)
plotMA(resNorm, ylim = c(-5,5))
results <- as.data.frame(resNorm)
View(results)
After that steps, we need to add gene annotions to the results, and saved the results as anno_results.
anno <- AnnotationDbi::select(org.Hs.eg.db, rownames(results),
columns=c("ENSEMBL", "ENTREZID", "SYMBOL","GENENAME"),
keytype="ENSEMBL")
anno_results <- as.data.frame(results) %>%
rownames_to_column(var = "ENSEMBL") %>%
left_join(anno, by = "ENSEMBL")
Table 1: DEGs analysis results table
datatable(anno_results,options=list(scrollX = TRUE,
scrollY = TRUE),
caption = 'Table 1: Enriched Pathways',
class = "nowrap display")
A volcano plot was plotted. In order to make interpretations to this data, we need to identify significant genes.
EnhancedVolcano(anno_results,
lab = anno_results$SYMBOL,
x = "log2FoldChange",
y="padj",
border = "full",
borderWidth = 1.5,
borderColour = "black",
gridlines.major = FALSE,
gridlines.minor = FALSE,
title = "Untreated versus treated")
We can obtain significant genes by taking the genes that have padj value less than 0.01, log2FoldChange greater than or equal to 1 and, base mean greater than or equal to 20. I saved the results as significant object and printed the head of significant object to examine.
After that we need to order the significant genes and set the get the top DEGs with the helpp of the ids, so we set the ENSEMBL, SYMBOL ids of significant genes as id1 and id2 objects. Then we subset the matrix by id1, which collects DEGs into DEGs objct by taking ids (id1) as base. Then symbols (id2) added as rownames of DEGs. After that top 20 DEGs were obtain by taking 20 rows of the DEGs.
A heatmap was plotted to visualize top 20 differentially expressed genes.
significant <- anno_results[which(anno_results$padj < 0.01 &
abs(anno_results$log2FoldChange) >= 1 &
anno_results$baseMean >= 20), ]
head(significant[order(significant$log2FoldChange ), ] )
## ENSEMBL baseMean log2FoldChange lfcSE stat
## 38558 ENSG00000248939 1900.9167 -7.304877 0.3403203 -12.97524
## 10848 ENSG00000162631 154439.6498 -4.848216 0.2548772 -19.11121
## 51958 ENSG00000271447 10767.3761 -4.418288 0.2503990 -17.69466
## 31440 ENSG00000232896 890.1011 -4.384182 0.3859359 -11.09997
## 6479 ENSG00000131386 53712.9171 -4.341699 0.2237796 -19.43618
## 30238 ENSG00000231156 389.8445 -4.321078 0.3852984 -10.15897
## pvalue padj ENTREZID SYMBOL
## 38558 1.690735e-38 1.199452e-35 <NA> <NA>
## 10848 2.036871e-81 2.203640e-77 22854 NTNG1
## 51958 4.609717e-70 1.994855e-66 79148 MMP28
## 31440 1.254859e-28 4.242501e-26 <NA> <NA>
## 6479 3.815190e-84 8.255118e-80 117248 GALNT15
## 30238 3.022537e-24 6.475263e-22 <NA> <NA>
## GENENAME
## 38558 <NA>
## 10848 netrin G1
## 51958 matrix metallopeptidase 28
## 31440 <NA>
## 6479 polypeptide N-acetylgalactosaminyltransferase 15
## 30238 <NA>
mat <- assay(rld)
orderedSig <- significant[order(significant$padj), ]
id1 <- significant$ENSEMBL
id2 <- significant$SYMBOL
DEGs <- mat[id1,]
rownames(DEGs) <- id2
top20DE <- head(DEGs, n=20)
pheatmap(top20DE,
scale = "row",
clustering_distance_rows = "correlation",
main="Top 20 Differentially Expressed genes")
Table 2: Top 20 DEGs
datatable(top20DE,options=list(scrollX = TRUE,
scrollY = TRUE),
caption = 'Table 2: Top 20 differentially expressed genes (DEGs)',
class = "nowrap display")
To make Gene Set Enrichment Analysis (GSEA), first we need to rank all of the genes. But when we add gsea_metric column, it contains infinite numbers. So to get rid of the infinite numbers, padj column added. It basically changes padj to smallest number of padj anywhere it gets 0. After that, rows with missing values removed and the data was arranged by gsea_metric in descending order. The filtered results saved as resdf object.
resdf <- anno_results %>%
arrange(padj) %>%
mutate(gsea_metric = -log10(padj) * sign(log2FoldChange)) %>%
mutate(padj = case_when(padj == 0 ~ .Machine$double.xmin,
TRUE ~ padj)) %>%
filter(! is.na(gsea_metric)) %>%
arrange(desc(gsea_metric))
Table 3: GSEA Data set
datatable(resdf,options=list(scrollX = TRUE,
scrollY = TRUE),
caption = 'Table 3: GSEA data',
class = "nowrap display")
To run GSEA we need ranked vector of the GSEA datas and gene sets. So GSEA datas were ranked and saved as a vector into the ranks object. Gene sets was obtained from msigdbr package and saved as gene_sets object. And GSEA was performed. The results saved as data frame into gsearesdf object and printed. Dotplot was plotted.
gene_sets <- msigdbr(species = "Homo sapiens", category = "C5")
gene_sets <- gene_sets %>%
dplyr::select(gs_name, gene_symbol)
ranks <- resdf %>%
select(SYMBOL, gsea_metric) %>%
distinct(SYMBOL, .keep_all = TRUE) %>%
deframe()
gseares <- GSEA(geneList = ranks,
TERM2GENE = gene_sets)
gsearesdf <- as.data.frame(gseares)
dotplot(gseares)
So in order to make a plot that nicely understandable top 5 of over-expressed pathways was extracted from the gsearesdf data frame as top_pathways and gseaplot was plotted for each pathway. Plots were saved as GSEA_up.png
top_pathways <- gsearesdf %>%
top_n(n = 5, wt = NES) %>%
pull(ID)
top_pathway_plots <- lapply(top_pathways, function(pathway) {
gseaplot(gseares, geneSetID = pathway, title = pathway)})
top_pathway_plot <- ggarrange(plotlist = top_pathway_plots,
ncol = 3, nrow = 2, labels = "AUTO")
ggsave(top_pathway_plot, filename = "GSEA_up.png",
scale = 1.25,
height = 11, width = 18)
Table 4 : Top 5 over-expressed pathways.
gsearesdf %>%
top_n(n = 5, wt = NES) %>%
datatable(options=list(scrollX = TRUE,
scrollY = TRUE),
caption = 'Table 4: Top 5 over-expressed pathways.',
class = "nowrap display")
Gseaplots for top 5 over-expressed pathways
GSEA_up.png
The same methodology was applied to plot Gseaplots for top 5 under-expressed pathways.
bottom_pathways <- gsearesdf %>%
top_n(n = 5, wt = -NES) %>%
pull(ID)
bottom_pathway_plots <- lapply(bottom_pathways, function(pathway) {
gseaplot(gseares, geneSetID = pathway, title = pathway)})
bottom_pathway_plot <- ggarrange(plotlist = bottom_pathway_plots,
ncol = 3, nrow = 2, labels = "AUTO")
ggsave(bottom_pathway_plot, filename = "GSEA_down.png",
scale = 1.5,
height = 11, width = 18)
Table 5 : Top 5 under-expressed pathways.
gsearesdf %>%
top_n(n = 5, wt = -NES) %>%
datatable(options=list(scrollX = TRUE,
scrollY = TRUE),
caption = 'Table 5: Top 5 under-expressed pathways.',
class = "nowrap display")
Gseaplots for top 5 under-expressed pathways
GSEA_down.png
Table 6: Enriched pathways table
sig_gsea <- gsearesdf %>%
filter(p.adjust <= 0.05) %>%
arrange(desc(NES)) %>%
select(NES, pvalue, p.adjust, core_enrichment) %>%
mutate(pvalue = format(pvalue, digits = 5,
scientific = TRUE),
p.adjust = format(p.adjust, digits = 5,
scientific = TRUE),
NES = signif(NES, 5))
datatable(sig_gsea,options=list(scrollX = TRUE,
scrollY = TRUE),
caption = 'Table 6: Enriched Pathways',
class = "nowrap display")
So we can answer these questions with the results : How does treatment affects the cell behavior?
In the over-expressed pathways, we see “epidermal cell differentiation” and “skin development” pathways. But in under-expressed pathways, these pathways were not present. So we can say that treatment suppressed epidermal cell differentiation and skin development pathways. Also, in the experiment, they say that some of the treated samples show resistance to treatment. We see “mitotic spindle organization” and “nucleosome assembly” in under-expressed pathways (treated). These pathways may be the cause of resistance to treatment. New experiments can be conducted to investigate the affect of these pathways. In addition, we can apply gene enrichment analysis based drug repurposing to target specific pathways enriched in melanoma.